Monte Carlo method is a class of algorithms that draw random numbers and compute their statistics to approximate quantities of interest. The method uses randomness even to solve deterministic problems, for which deterministic algorithms are too inefficient to be practical. Stanislaw Ulam was one of the contributors who developed the method in the mid 20th Century. The method was named after Ulam’s uncle who just had to go to Monte Carlo for gambling. Monte Carlo integration is the most common application of the method.
Recall that, for many functions, it is impossible to carry out analytical integration, which is the motivation for numerical integration. Deterministic methods such as the trapezoidal rule and Simpson’s rule have their pros and cons. They often work pretty well in one-dimensional problems, but their performance rapidly deteriorates as the dimensionality increases. Monte Carlo (MC) integration is an alternative approach to high-dimensional problems, in which MC integration tends to be computationally more efficient than deterministic methods.
The theoretical foundation of MC integration is the law of large number. In words, for reasonable \(X\) and \(h\), the sample average converges to the true mean. Or, \[\frac{1}{n}\sum_{i=1}^n h(X_i) \to \mathbb{E}h(X)\] as \(n\to\infty\). How can this help with difficult integration?
Remember that, for continuous \(X\), the expectation of \(h(X)\) is defined as an integral: \[\mathbb{E}h(X) = \int_{\mathcal{S}} h(x)f(x)dx ,\] where \(\mathcal{S}\) is the support of \(X\). Therefore, expressing a difficult integral in terms of the expectation of some random variable \(X\), for which we can collect lots of samples, we can use the sample average to approximate that integral.
“we can collect lots of samples” is a catch and addressed in sampling methods. But, for now, let’s assume we can.
Suppose we want to integrate \(\phi(x)\) over the region \(\mathcal{X}\). Let \(f\) be the PDF for \(X\) such that \(\mathcal{X}\subseteq\mathcal{S}\). Then,
\[\begin{align} \int_\mathcal{X}\phi(x)dx &= \int_\mathcal{S}\phi(x)\mathbb{1}_{\mathcal{X}}(x)dx\\ &= \int_\mathcal{S} \frac{\phi(x)\mathbb{1}_{\mathcal{X}}(x)}{f(x)}f(x)dx\\ &= \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right] \end{align}\]
We can approximate this expectation by the sample mean estimator \(\hat{M}\): \[\hat{M} = \frac{1}{n}\sum_{i=1}^n \frac{\phi(X_i)\mathbb{1}_{\mathcal{X}}(X_i)}{f(X_i)} \approx \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right].\]
Note that, under the IID assumption on \(\{X_i\}\), \(\hat{M}\) is an unbiased estimator, meaning \[\mathbb{E}\hat{M} = \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right].\]
If \(\mathcal{X}=\mathcal{S}\), of course, we can drop \(\mathbb{1}_{\mathcal{X}}\) everywhere and have cleaner expressions.
As you may notice, in a sense, the indicator function \(\mathbb{1}_{\mathcal{X}}(x)\) checks \(x\)’s membership in \(\mathcal{X}\) — a member \((1)\) or not \((0)\). Judicious use of indicator function is powerful, allowing us to formulate many different quantities as expectations. For example, probability can be expressed as an expectation: \[ P(X\in \mathcal{X}) = \int_\mathcal{X}f(x)dx = \int_{\mathcal{S}} \mathbb{1}_{\mathcal{X}}(x)f(x)dx = \mathbb{E}\mathbb{1}_{\mathcal{X}}(X).\]
Let’s compare MC integration with Simpson’s rule. A test problem is to integrate \[ \phi(x) = x\sin(x)+5 \enspace\text{over}\enspace \mathcal{X}=[0,3\pi], \] for which we know the answer is \(18\pi \approx 56.5486678\). With only \(n=100\) equally-spaced evaluation points, Simpson’s rule achieves \(56.5486719\).
In principle, we may use any continuous distribution provided \(\mathcal{X}\subseteq\mathcal{S}\). We experiment here \(X_1\sim \text{U}(0,10)\) and \(X_2\sim\text{exp}(1)\) with \(n=10^4\).
First, \(X_1\sim \text{U}(0,10)\) has the support \(\mathcal{S}_1=[0,10]\) and the PDF \(f_1(x)=1/10\) for \(x\in\mathcal{S}_1\) and \(0\) otherwise. Hence, the estimator is \[\hat{M}_1 = \frac{1}{n}\sum_{i=1}^n \frac{\phi(X_{1i})\mathbb{1}_{\mathcal{X}}(X_{1i})}{f_1(X_{1i})} = \frac{1}{n}\sum_{i=1}^n 10\phi(X_{1i})\mathbb{1}_{\mathcal{X}}(X_{1i}).\]
Next, \(X_2\sim\text{exp}(1)\) has the support \(\mathcal{S}_2=[0,\infty)\) and the PDF \(f_2(x)=e^{-x}\) for \(x\in\mathcal{S}_2\) and \(0\) otherwise. Hence, the estimator is \[ \hat{M}_2 = \frac{1}{n}\sum_{i=1}^n \frac{\phi(X_{2i})\mathbb{1}_{\mathcal{X}}(X_{2i})}{f_2(X_{2i})} = \frac{1}{n}\sum_{i=1}^n e^{x}\phi(X_{2i})\mathbb{1}_{\mathcal{X}}(X_{2i}). \] Here are results.
# X1
x <- runif(n, 0, 10)
x <- x[x < 3*pi] # only those in the integration region
10*sum(phi(x))/n## [1] 56.25112
# X2
x <- rexp(n)
x <- x[x < 3*pi] # only those in the integration region
sum(exp(x)*phi(x))/n## [1] 40.01199
As you can see, despite the much larger samples, the results are inferior. In particular, \(\hat{M}_2\) suffers high variance.
m <- 100
re <- rep(0,m)
for (i in 1:m) {
x <- rexp(n)
x <- x[x < 3*pi] # those in the integration region
re[i] <- sum(exp(x)*phi(x))/n
}
paste("Standard deviation =", round(sd(re),2))## [1] "Standard deviation = 10.6"
Well, this is expected; what’s the point of using \(X\) with the support of \([0,\infty)\) to compute the integral of \(\phi(x)\) over \([0,3\pi]\)? Moreover, implied by \(P(X_2\leq \pi)\approx 0.96\) most samples are taken from \([0,\pi]\), and hence the integrand is rarely evaluated over \([\pi,3\pi]\). In cases like this, we cannot expect good approximation via MC integration. We will come back to this issue, inefficiency in particular, to motivate importance sampling.
To simplify problems and highlight key features, let’s consider the uniform distribution over some rectangle \(\mathcal{S}\subset\mathbb{R}^2\). Recall that the uniform distribution over \(\mathcal{S}\in\mathbb{R}^d\) is equivalent to having the following PDF of \(X\) \[ f(x) = \begin{cases} \frac{1}{|\mathcal{S}|} & x\in\mathcal{S}\\ 0 & \text{otherwise} \end{cases} \] where \(|\mathcal{S}|\) is the Lebesgue measure of \(\mathcal{S}\). It is a normalisation constant to make the PDF integrates to \(1\).
A canonical problem is to use MC integration to estimate \(\pi\). First, let \[ \mathcal{X} = \{x\in\mathbb{R}^2 : \|x\|\leq1, x\geq0\},\] a quarter disk in the positive orthant whose area is \(\pi/4\). Using the support \(\mathcal{S} = [0,1]^2\) with \(|\mathcal{S}|=1\), we have \[\begin{align} \frac{\pi}{4} &= \int_\mathcal{X}dx\\ &= \int_\mathcal{S} \mathbb{1}_\mathcal{X}(x)dx\\ &= \int_\mathcal{S} \frac{\mathbb{1}_\mathcal{X}(x)}{f(x)}f(x)dx\\ &= \int_\mathcal{S} \mathbb{1}_\mathcal{X}(x)f(x)dx\\ &= \mathbb{E}\left[\mathbb{1}_{\mathcal{X}}(X)\right] \end{align}\]
The idea is to use MC integration to approximate the last expectation, which in turn tells us \(\pi\).
n <- 1e5
x <- c()
y <- c()
for (i in 1:n) {
xy <- runif(2)
if (sum(xy^2)<=1) {
x <- c(x, xy[1])
y <- c(y, xy[2])
}
if (i%%5e3 == 0) {
plot(x, y, xlim=c(0,1), ylim=c(0,1), ann=FALSE, pch=19, cex=0.1)
title(paste("n = ", i, ", ", "π = ", round(4*length(x)/i, 5), sep=""))
}
}This is nice but rather contrived, because our calculators give us better estimates by just one click. The next example is so much more interesting.
If you are new to the Mandelbrot set, watch this video for a few minutes and enjoy the breathtaking intricacy. Using MC integration, we can estimate the size of such an infinitely intricate object. Amazing isn’t it?
To use the same approach as above, we must come up with a membership criterion. It is easy for the unit disk \(\mathcal{X} = \{x:\|x\|\leq1\}\). But, not the case here because we do not have such a simple expression for the Mandelbrot set. First, define \[ f_c(z) = z^2 + c \] for some \(c\in\mathbb{C}\). Let \(f_c^{(k)}(z)\) be the composition of \(f_c\) with itself \(k\) times, i.e., \[ f_c^{(k)}(z) = (f_c\circ f_c\circ \dots\circ f_c)(z). \] Then, the Mandelbrot set \(\mathcal{M}\) is expressed as: \[ \mathcal{M} = \left\{c\in\mathbb{C} : \lim_{k\to\infty}|f_c^{(k)}(0)|\neq\infty\right\} . \] Since the definition of \(\mathcal{M}\) involves a limit, we cannot directly use it to check the membership. Fortunately, it is known that \[ c\in\mathcal{M} \enspace\text{if and only if}\enspace |f_c^{(k)}(0)|\leq 2, \;\forall k\geq 0. \] Although impossible to check this condition literally, we can at least disqualify \(c\) such that \(|f_c^{(k)}(0)|> 2\) for some \(k\leq K\) where \(K\) is our maximum iterations. Finally, we are ready to carry out MC integration to estimate the area of the Mandelbrot set! This time, we use the PDF \[ f(x) = \mathbb{1}_{\mathcal{S}}(x)/6 \] where \(\mathcal{S}=[-2,1]\times[-1,1]\).
n <- 1e7
K <- 2000
re <- 3*runif(n)-2
im <- 2*runif(n)-1
c <- complex(real=re, imaginary=im)
z <- c
for (k in 1:K) {
ind <- abs(z)<=2
c <- c[ind]
z <- z[ind]
z <- z^2 + c
}
c <- c[abs(z)<=2]
par(mar=c(3,3,.1,.1))
plot(c, xlim=c(-2,1), ylim=c(-1,1), ann=FALSE, pch=19, cex=0.001)Our estimate of the area is 1.5073, which seems reasonable according to a more thorough analysis.
(Contrary to the name, importance sampling is NOT a sampling technique; it is a variance reduction technique and some sampling method is assumed to be available.)
We see some kind of inefficiency above when approximating the integral of \[ \phi(x) = x\sin(x)+5 \enspace\text{over}\enspace \mathcal{X}=[0,3\pi], \] using the sample mean estimator \(\hat{M}\) \[ \hat{M} = \frac{1}{n}\sum_{i=1}^n \frac{\phi(X_i)\mathbb{1}_{\mathcal{X}}(X_i)}{f(X_i)}, \] where, for all \(i\in\{1,\dots,n\}\), \(X_i\sim\text{exp}(1)\) whose support is \(\mathcal{S}=[0,\infty)\). We say an unbiased estimator \(\hat{M}\) is inefficient if \[\begin{align} \operatorname{Var}\hat{M} &= \operatorname{Var}\left[\frac{1}{n}\sum_{i=1}^n \frac{\phi(X_i)\mathbb{1}_{\mathcal{X}}(X_i)}{f(X_i)}\right]\\ &= \frac{1}{n}\operatorname{Var}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right] \end{align}\] is large.
Since \(\operatorname{Var}\hat{M}\) decreases as \(n\) increases, if both sampling from \(X\) and transformation \(\phi(X)\) are cheap, inefficiency may not be an issue. However, this is not always the case; \(\phi\) may be as complex as large climate simulation models and each computation of \(\phi(x)\) could take several seconds or longer. In such cases, we cannot afford to simulate millions of times and must come up with a better \(f\) to reduce \(\operatorname{Var}\hat{M}\).
When integrating \(\phi\) over \(\mathcal{X}\), what could be a good proposal distribution \(f\) that leads to small \(\operatorname{Var}\hat{M}\)? Remember, by definition, the more samples we collect away from \(\mathbb{E}\hat{M}\), the larger \(\operatorname{Var}\hat{M}\) is. To find a clue, let’s compute \(\operatorname{Var}\hat{M}\). \[\begin{align} \operatorname{Var}\hat{M} &= \frac{1}{n}\operatorname{Var}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right]\\ &= \frac{1}{n}\left( \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right]^2 - (\mathbb{E}\hat{M})^2 \right)\\ &= \frac{1}{n}\left( \int_\mathcal{S}\left(\frac{\phi(x)\mathbb{1}_{\mathcal{X}}(x)}{f(x)}\right)^2f(x)dx - (\mathbb{E}\hat{M})^2 \right), \end{align}\] which implies that \(\operatorname{Var}\hat{M}=0\) if \[f(x) = \frac{\phi(x)\mathbb{1}_{\mathcal{X}}(x)}{\mathbb{E}\hat{M}}.\]
Since \(f(x)\geq0, \forall x\) as a PDF, we are implicitly assuming \(\phi(x)\geq0, \forall x\). In a more general case where \(\phi(x)<0\) for some \(x\), it can be shown that \(f(x)=c|\phi(x)|\mathbb{1}_{\mathcal{X}}(x)\) minimises \(\operatorname{Var}\hat{M}\) where \(c\) is a normilisation constant.
Although we do not know \(\mathbb{E}\hat{M}\) (which is the goal of this exercise!), since \(f\) as a PDF must integrate to \(1\) anyway, we may absorb \(\mathbb{E}\hat{M}\) in the normalisation constant. Now, we make two observations:
These explain why \(X_i\sim\text{exp}(1)\) is a bad choice for the above MC integration.
In practice, importance sampling is often used when MC integration by directly sampling from PDF \(g\) for some expectation \[\mathbb{E}\psi(X) = \int_\mathbb{R}\psi(x)g(x)dx\] is difficult. It is difficult because either:
To connect with the notation, substitute \(\phi(x)\) for \(\psi(x)g(x)\) and let \(\mathcal{X}\) denote the support of \(g\). Then, the problem is again to compute the following: \[\begin{align} \int_\mathcal{X}\phi(x)dx &= \mathbb{E}\psi(X)\\ &= \int_\mathcal{X}\psi(x)g(x)dx\\ &= \int_\mathcal{S}\frac{\psi(x)g(x)\mathbb{1}_{\mathcal{X}}(x)}{f(x)}f(x)dx\\ &= \int_\mathcal{S}\frac{\psi(x)g(x)}{f(x)}f(x)dx\\ &= \mathbb{E}\left[\psi(X)\frac{g(X)}{f(X)}\right], \end{align}\] where \(\frac{g(X)}{f(X)}\) is called the importance weight/ratio. Because of sampling from \(f\) to compute the expectation with respect to \(g\), it makes an appropriate adjustment to each \(\psi(x_i)\) by multiplying it by \(g(x_i)/f(x_i)\).
Note that \(\mathbb{1}_{\mathcal{X}}(x)\) is dropped because \(g(x)\) plays the same role, i.e., assignning \(0\) to \(x\notin\mathcal{X}\).
In this special case, technically, we do not need \(\mathcal{X}\subseteq\mathcal{S}\). We just need a weaker condition that \(\psi(x)g(x)=0\) for all \(x\in\mathcal{X}\cap\bar{\mathcal{S}}\).
Suppose we want to integrate \[ \psi(x)=x^{-\alpha}e^{-x} \enspace\text{over}\enspace \mathcal{X}=(0,1]. \] where \(\alpha\in(0,1)\). Note that since we cannot evaluate \(\psi(0)\), it is not so straightforward to apply the trapezoidal rule or Simpson’s rule.
The integral is already in the form of expectation \[\int_0^1x^{-\alpha}e^{-x}dx = \mathbb{E}\left[X^{-\alpha}e^{-X}\right]\] where \(X\sim \text{U}(0,1)\) and the PDF is \(g(x)=\mathbb{1}_{\mathcal{X}}(x)\). So, a naïve MC integration is to use the following estimator \[ \hat{M}=\frac{1}{n}\sum X_i^{-\alpha}e^{-X_i}. \] But, this is problematic because samples near \(x=0\) have disproportionately large values and can dominate \(\hat{M}\). Given the intuitive criteria we made above, we propose a PDF \[ f(x) = \begin{cases} (1-\alpha)x^{-\alpha} & x\in(0,1]\\ 0 & \text{otherwise} \end{cases} \]
As you can see below, for \(\alpha=0.5\), the shape of two graphs are reasonably similar (and the support of \(f\) is equal to \(\mathcal{X}\)).
alpha <- .5
curve((1-alpha)*x^(-alpha), from=0, to=1, ylab="f(x), ψ(x)")
curve(x^(-alpha)*exp(-x), from=0, to=1, lty=2, add=TRUE)
title("Integrand and proposal distribution")
legend(0.7, 5, lty=c(1,3),
legend=c(expression(f(x)==0.5*x^-.5), expression(ψ(x)==x^-.5*e^-x)))Now, our estimator is \[\begin{align} \hat{M} &= \frac{1}{n}\sum_{i=1}^n \psi(X_i)\frac{g(X_i)}{f(X_i)}\\ &= \frac{1}{n}\sum_{i=1}^n \frac{\psi(X_i)}{f(X_i)}\\ &= \frac{1}{n}\sum_{i=1}^n \frac{X_i^{-\alpha}e^{-X_i}}{(1-\alpha)X_i^{-\alpha}}\\ &= \frac{1}{n}\sum_{i=1}^n \frac{e^{-X_i}}{1-\alpha} \end{align}\] Note that, in contrast to the above naïve estimator, \(\frac{e^{-x}}{1-\alpha}\) is bounded in \(\mathcal{X}=(0,1]\). Sampling from \(f\) using the inverse transform method, we get a result.
alpha <- 0.5
n <- 1e6
u <- runif(n)
x <- u^(1/(1-alpha)) # the inverse transform method
sum(exp(-x)/(1-alpha))/n## [1] 1.493669
To compare, here is another result obtained using
gammainc function from pracma package.
gammainc(1,0.5)[1]## lowinc
## 1.493648
F.Y.I. \(\gamma(x,\alpha)=\int_0^xt^{\alpha-1}e^{-t}dt\) is called the lower incomplete gamma function. Our integral above is equal to \(\gamma(1,0.5)\). Hence, the use of
gammainc.